Profiling Data

"A method of understanding and justifying data storage expenditures"

It is commonly understood that an enterprise's data is its most valuable asset. However, it is difficult and perplexing to find a solution to data storage management problems because of the many vendor approaches that are available. These solutions are usually sought to address increasing needs for user primary data storage, file server storage, or backup. Some of the alternatives include: more hard disk space, RAID systems, manual or automatic backup or archiving solutions. The quest for a solution should properly start long before those specific products are looked at - it starts with an analysis of the data in use by the enterprise and the nature of that data.

The management of data is becoming more and more expensive, not only in actual dollar cost of equipment and media, but also in time spent by administrative personnel. When a site uses only conventional backup, expense escalates exponentially as the amount of data increases. The first step toward an automated archiving facility is understanding the difference between dynamic and static data. Only by understanding the data differences can a site administrator provide the most cost- effective and reliable service.

Static Data

Static data does not change or does not change very often. Static data can be divided into two basic categories: important data and abandoned data. Abandoned data is defined as data which is the product of an intermediate computing step or output which, after initial use, is no longer needed-in either case it can be discarded.

Separating important static data from dynamic and abandoned data defines which data should be placed on a secure media. It also allows multiple copies to be kept of only the important data, saving space and expenditures. This also minimizes the bulk of the off-site copy stored as part of a disaster control plan.

A file that has not been modified can be considered a static file. If a file has not been accessed it might also be considered an abandoned file. Static data that are not modified and are not abandoned are data such as released engineering drawings, fingerprints, and medical images. Similarly, important static files that are not accessed might be those that are kept for regulatory reasons, like financial records or aircraft maintenance records. The criteria for defining static, abandoned and dynamic data vary by site-this is often the most important step in identifying areas for improved efficiency.

Static application data

Application files which are static must always be available on magnetic disk because they are frequently accessed. However, with applications being distributed about the system-sometimes the same files reside on each workstation on the network. For example, on-line manuals are a prime example of static data, not necessarily accessed often, but definitely needed on-line. An analysis of how many copies of on-line manuals are repeated on each workstation (250MB repeated on only 10 workstations equals 2.5 gigabytes) can be enlightening. A great deal of on-line storage can be freed by eliminating duplicate sets of data. Perhaps equally important, this static data should not be backed up on each workstation week after week.

It is commonly understood that an enterprise's data is its most valuable asset. However, it is difficult and perplexing to find a solution to data storage management problems because of the many vendor approaches that are available. These solutions are usually sought to address increasing needs for user primary data storage, file server storage, or backup. Some of the alternatives include: more hard disk space, RAID systems, manual or automatic backup or archiving solutions. The quest for a solution should properly start long before those specific products are looked at-it starts with an analysis of the data in use by the enterprise and the nature of that data.

The management of data storage is becoming more and more expensive, not only in actual dollar cost of equipment and media, but also in time spent by administrative personnel. When a site uses only conventional backup, expense escalates exponentially as the amount of data increases. The first step toward an automated archiving facility is understanding the difference between dynamic and static data. Only by understanding the data differences can a site administrator provide the most cost- effective and reliable service.

Dynamic Data

Dynamic data is working data. The media selected for it should be read/write since it will be reused. Frequently this media will be different from the secure media noted above, since the data is accessed and updated frequently (usually it is on fast hard disk). As this data ages, it may be separated into important or abandoned and then may be "migrated" to another media. This is the principle behind hierarchical storage management (HSM) - the different media form a "hierarchy" according to cost, security, speed or other characteristics.

Great savings, along with increased data integrity can be achieved on hard disk space if data storage is managed according to its importance to the enterprise.

Site Analysis

As mentioned above, the data at computational facilities can be classified into two major categories: static and dynamic. Many facilities do not separate these types of data - most are not aware of what data in the organization falls into each category. In most cases all the data, whether static or dynamic, is backed up. In most networks, only 15 to 25 per cent of the overall data is dynamic. This means that up to 85% of those backups are the same data, week after week. Likewise, backups recorded to media in a robotic unit can create many copies of the same data. This highlights the need for the site administrator to profile the data.

Static data is an obvious candidate for an automated archiving facility - because it can be removed from the daily backup process. Dynamic data remains a candidate for regular backups. Managed correctly, this data can also be selectively removed from the backup process, according to usage patterns or life span requirements. This management process greatly reduces the load on backup systems and forms the foundation of a disaster recovery system. Another way to describe the difference is that backup is simply multiple copies of the same data; An automated archiving facility provides version control, managing not only the "archived" copy, but also the "on- line" copy. This can include maintaining a number of versions of the data, while ensuring that only the newest is modified.

Retrieval is different for backed-up data and data stored in an automated archive environment. The user or administrator must explicitly request the retrieval of a file from the backups. Movement of that retrieved copy must be manually managed - verifying that it is the latest version and that no other versions exist on-line. For this reason, backup always involves (and should involve) human intervention. It is, at best, inefficient, and at worst can result in corrupted data. A file in an automated archive environment can be retrieved automatically with no human intervention (other than requesting the file). Version control is handled automatically, through the hierarchical storage manager and file-locking controls of the O/S. In fact, the user does not even need to know where the file actually resides within the archive hierarchy. An automated archiving facility makes archived data appear to be on-line to the user.

Recording Format

For a complete analysis, data should be profiled over a period of several months to account for vacations, ending/starting projects, etc. The administrator should be aware of the archive format for each type of data. Critical data preserved longer than two or three years should not be in a notation linked to a particular CPU architecture or vendor because it will certainly outlive the hardware technology. Users should be made aware of this problem because there is a significant cost involved in conversion of data to new platforms.

Proprietary storage formats are an even bigger problem for the site. Many sites have purchased "open" architectures, and they feel confident in their ability to move forward in an independent manner. But many of these sites have not asked the important question, "How is my data stored on the media?" Is it open? - are ANSI standard labels recorded? How are the files recorded? - who can read this data? If only the recording vendor's product can read the data, the site is totally dependent on this vendor - not an open solution - because data conversions would have to be performed for any change in vendor or operating platform. This could be an integrity problem if the vendor is no longer supporting a particular product or platform.

An Example

To illustrate the points in this document, the following is data from an actual site: A large computation development facility has 3 million files, of which only 7.7% were modified within 30 days. This amounted to only 9.5% of the 1.8 terabytes (millions of megabytes) of data on the system. Only 14.1% of the 3 million files were accessed within 30 days, which amounted to 13.1% of the data.

This particular site used conventional backup methods (i.e. incremental backups every day with full backups every weekend). Over a span of a month (including 4 weekends), 1.6 terabytes (.905 x 1.8 terabytes=1.6 terabytes) of the same data is repeated 4 times in its backups. This is equal to 4,000 3490 tapes (400 megabytes per tape) x 4, or 16,000 tapes holding the same data.

Summary

As an exercise, analyzing the static and dynamic data at a site can identify areas of waste and can suggest whether more disk, more backup, and/or HSM is the right solution. In addition, the exercise can yield some additional benefits, like forcing site administrators to face questions like:

Is the data safeguarded adequately?
Is the site making the most efficient and cost-effective use of media?
Does the site have the data administration process under control?
Does the enterprise know where its most valuable data resides?

Return to Tech Notes Menu.
Return to LSCI Home Page.

(C)1994, LSC, Inc. All rights reserved.
Storage and Archiving Manager (SAM-FS) and Fast File Recovery System are trademarks of LSC, Inc. All other trademarks are the property of their respective owners.

For more information, send us mail:

inform@lsci.com.